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Abstract 

Given a set of detections, detected at each time instant independently, we investigate how to associate them 
across time. This is done by propagating labels on a set of graphs, each graph capturing how either the spatio- 
temporal or the appearance cues promote the assignment of identical or distinct labels to a pair of detections. The 
graph construction is motivated by a locally linear embedding of the detection features. Interestingly, the neighborhood 
of a node in appearance graph is defined to include all the nodes for which the appearance feature is available (even 
if they are temporally distant). This gives our framework the uncommon ability to exploit the appearance features 
that are available only sporadically. Once the graphs have been defined, multi-object tracking is formulated as the 
problem of finding a label assignment that is consistent with the constraints captured each graph, which results into 
a difference of convex (DC) program. We propose to decompose the global objective function into node-wise sub¬ 
problems. This not only allows a computationally efficient solution, but also supports an incremental and scalable 
construction of the graph, thereby making the framework applicable to large graphs and practical tracking scenarios. 
Moreover, it opens the possibility of parallel implementation. 
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Discriminative and Efficient Label 
Propagation on Complementary Graphs for 

Multi-Object Tracking 


1 Introduction 


I N this paper, we address the problem of multi¬ 
object tracking (MOT). We assume that the targets 
of interest have been detected at each time instant 
|11|, |28j, |29j and that some appearance features have 
been extracted. Given this error-prone information, 
our objective is to link these detections into consistent 
trajectories using a graph-based formalism. 

Conventionally, a graph-based formalism assigns a 
node to each detection. Edges are then defined to 
connect the nodes, and each edge gets a cost that 
reflects the dissimilarity between the two nodes it 
connects. Afterwards, a (I\ )-shortest path algorithm 
] 121 is typically used to find the trajectories of the ( K) 
targets. Alternatively, other approaches use network 
flow [37|, maximum weighted independent set [ 25 ], 
etc. to solve the same problem. These approaches 
have been proven to be effective in scenarios for 
which the features are collected with the same level of 
accuracy and reliability for each detection. With such 
a stationary measurement process, the likelihood that 
the detections along a path correspond to the same 
physical object can be reasonably estimated based 
on the accumulation of dissimilarities (similarities) 
between (close to)consecutive nodes in the path. In 
contrast, these approaches are not appropriate in cases 
for which appearance features cannot be measured 
with same accuracy and reliability in every space and 
time co-ordinates. Such cases are prevalent in many 
real-life situations. For example, color histograms tend 
to be noisy in presence of occlusions. In some other 
cases, highly discriminative features are available only 
sporadically. This happens, for example, while imag¬ 
ing biological cells under varying illuminations, each 
illumination level highlighting certain features of the 
cell. As another example, a digit, printed on the jersey 
of a player, is available only when it faces the camera. 
In such cases, the task of tracking multiple objects, 
while exploiting such features, becomes non-trivial. 

Two works have recently addressed this category 
of problems. In their formulation, the authors in pT| 
assume that a discrete set of L possible appearances 
is known beforehand, which allows the creation of a 
L-layered graph. In the z-th layer, running through 
a node is penalized when the appearance of the 
node is available and differs from the ?-th presumed 


appearance. Afterwards, a K'-shortest path algorithm 
is applied to find the K shortest paths across the 
/.-layers. This method demonstrates that exploiting 
sporadic features can significantly improve the track¬ 
ing performance. However, it is restricted to cases for 
which the number and the appearance of the targets 
are known a priori. 

In contrast, |j8j does not make any assumption about 
the (number of) targets appearances. It proposes a 
widely applicable iterative hypothesis testing strategy 
to exploit appearances features that are corrupted by 
non-stationary noise or are only sporadically avail¬ 
able. In short, the authors iteratively consider each 
node in the graph as a key-node, and investigate how 
to link this key-node with other nodes in its neigh¬ 
borhood, under the hypothesis that the appearance 
observed in the key-node position is representative of 
the actual appearance of the target. The method offers 
the advantage to handle cases for which the discrete 
set of possible appearances is not known a priori. The 
greedy and iterative nature of the algorithm makes 
it efficient from a computational and memory usage 
perspective (no /.-layered graph). Its main disadvan¬ 
tage is that it is greedy and consequently does not 
guarantee the global optimality of the solution. 

In this paper, we extend our initial contribution 
in |7j. We adopt a graph-based label propagation 
framework. Therefore, we construct a number of dis¬ 
tinct graphs, one for each appearance feature, apart 
from the usual spatio-temporal graph. Additionally, 
we also construct an exclusion graph to reflect the fact 
that two detections that occur at the same time should 
be assigned to distinct labels. Hence, we construct 
K + 2 'complementary' graphs (one spatio-temporal, 
K appearance, one exclusion), where K -C L is the 
number of appearance features. An example is shown 
in Figure [l] In case of a sport game, for example, 
the jersey color and the digit, printed on it, can be 
considered as two appearance features, and result in 
two distinct appearance graphs. 

During graph construction, a node is assigned to 
each detection. For all the graphs but the exclusivity 
one, edges connect pairs of nodes with a weight that 
increases with the similarity between the nodes in 
terms of space, time or appearance. The higher the 
weight, the more likely the two nodes correspond to 
the same physical target. Exceptionally, the edges of 
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Time 

(a) Two trajectories with 2x2 appearance measurements (b) Spatio-temporal graph 

Fig. 1. (a) An example with two targets (red and blue) with associated detections at each time. Gray detections 
mean that no appearance feature is available, (b) Spatio-temporal graph that depicts the spatio-temporal 
association between the nodes, (c) Appearance graph that connects nodes even if they are far in time, (d) 
Exclusion graph in which edges connect nodes that coexist at the same time. 



(c) Appearance graph (d) Exclusion graph 


the exclusion graph only connect nodes that cannot 
belong to the same physical target. This is relevant, 
for example, when the detections occur at the same 
time. 

Given these graphs, MOT problem is formulated as 
finding a consistent label assignment, which means 
that (i) the nodes that are sufficiently close in 
space/time and/or appearance are labeled similarly, 
and (ii) the nodes that co-exist at the same time 
are labeled differently. The consistency of labeling is 
measured by the labeling energy, which accumulates 
the difference in the labels between a node and other 
nodes that are connected to it. Due to the definition 
of weights in our graph, a good labeling should 
minimize the energy in the spatio-temporal and the 
appearance graphs while maximizing the energy due 
to the exclusion graph. Following our initial contri¬ 
bution in |7|, our paper formulates the multi-object 
tracking with sporadic appearance features as a label¬ 
ing problem in a number of complementary graphs. 
In addition to j7j|, it also proposes: 

• an efficient solution to the labeling problem, 
splitting the 'big' problem into 'small' node-wise 
problems that can be solved locally, optionally 
based on a parallel implementation (Section |3}, 

• an extension of the local label propagation pro¬ 
cess to handle incremental/on-line tracking sce¬ 
narios (Section |4j). 

Those two novel contributions make our proposed 
framework particularly suitable to practical scenarios. 

The rest of the paper is organized as follows. Sec¬ 
tion [2] formulates the MOT problem as a consistent 
label assignment problem. Section [3] proposes the 
solutions to the label assignment problem. A brief 
review of the related work is presented in Section [5] 
Experimental results are presented in Section [6] Sec¬ 
tion [7] concludes our paper. 

2 Tracking problem formulation 

This section first describes the construction of the as¬ 
sociated graphs. Afterwards, the multi-object tracking 
is formulated as a graph-consistent labeling problem. 

2.1 Notation 

Vectors and matrices are denoted with bold lower¬ 
case and upper-case symbols respectively while scalar 


values are denoted by light ones. Upper-case calli¬ 
graphic letters denote sets. 


K number of appearance features 
Xi feature vector of the Z-th sample 
Mi set of neighbors of the i-th sample 
X b) features of the neighbors of the i-th sample, i.e., 

* {i) ■ -bt.v.Rd 

w* optimal reconstruction weights for the i-th sample 
Q a graph of node set V, edge set £ and weight W 
n = |V|, number of nodes 

Laplacian of the Z-th graph for Z 6 (0,1,..., K}, 
Z = 0 for the spatio-temporal graph 
L * — ) Laplacian of the exclusion graph 

y i label distribution assigned to the i-th node 
m size of the label vector. 

Y = (y 1 , y 2 , •••, y n ) T , label assignment matrix 

Ad = {x 6 : \ T x = 1}, probability simplex of a 

given size d 

'Pnm set of all row-stochastic matrices of size n x m 

Parameters 
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Scaling factor for time (Section 2.2' 

Connection windo w size for spatio-temporal 
graph (Section 


2.2 


Maximum speed tor gating constraint (Section 2.2 
Wei ght a ssigned to the Z-th labeling energy (5ec> 
tion |2.3j| 

Connect ion window size for appearance graph 
(Section |4.1] 

'Heat' parameter (Section 


4.1 


Observa tion window for bounding complexity 
(Section 14.21 


TABLE 1 
Notations 


2.2 Graph construction 

We consider three distinct types of graphs. Hence, 
each graph should be constructed separately. Nev¬ 
ertheless, the constructions of spatio-temporal and 
appearance graphs follow the same approach, derived 
from the locally linear embedding (LLE) technique 
p7| . It assumes that data points can be accurately 
reconstructed by a weighted linear combination of 
their local neighbors. We motivate the linearity as¬ 
sumption by the fact that (i) target motion is linear in 
a small temporal window, and (ii) appearance features 
lie on a manifold. The number of neighbors is a design 
parameter, and should be chosen according to the 
kind of feature and the problem at hand. 

In the following, we represent the feature of the 
i-th detection by x, and that of its neighbors by 
X := U {xj}, where A/) is the set of neighbors of 
i. Afterwards, the graph construction is formulated as 
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the problem of finding the reconstruction weights w* 
that minimizes the following optimization problem 

w* = argmin ||ay - X^Wi\\l + f ||tUi||l, (1) 

where A m := {re G | w h 0,1 T «; = 1} is the 
probability simplex of a given size to. The reason to 
constrain the weights to belong to the simplex is that 
it promotes weight vector sparsity. To see this, we 
observe that the simplex constraint is equivalent to 
enforcing positive weights with unit ty- norm, and first 
consider the case with S = 0 in Equation |[Tj. When 
minimizing a quadratic fidelity as the one present in 
the first term of the cost of Equation ([TJ under such 

-norm constraint, the solution is generally restricted 
to a small dimensional facet of the unit f^-norm |4T| , 
|42| , i.e., a domain where the solution is sparse. We 
favor sparsity as it leads to an efficient optimization 
in Section [3] Promoting too much sparsity is however 
not desired. If a sample is similar to several other sam¬ 
ples ( e.g ., a feature occurs several times along the se¬ 
quence of detections), the sparse reconstruction selects 
only one neighbour and ignores the rest. To mitigate 
this limitation, we add a quadratic part |j|iiy|||, which 
offers an additional advantage of making the problem 
strongly convex, resulting in a unique w*. This can 
be seen as similar to an elastic net regularization in 
the sense that the sparsity term is imposed by the 
constraints. We use S = 10~ 2 H] 

Once the weights for each data point are computed, 
we gather them into a graph Q = (V. 8, W), where 

- V is the set of nodes, with i-th node correspond¬ 
ing to the z-th detection. We denote the number 
of nodes by n= |V|. 

- 8 defines the connectivity between the sam¬ 
ples such that an edge (i,j) is created be¬ 
tween nodes i and j only when the weight 
w*(j), resulting from Equation]!] is non-zero, i.e., 
8 = {( i,j) | w*(j) > 0 } . 

- W assigns a weight to each edge such that 

w*ti) if (hi)ef, 

0 otherwise. 

Now, we explain the specific issues in the construc¬ 
tion of each graph. 

Spatio-temporal graph. In case of the spatio- 
temporal graph, x.j is defined by the time instant t, 
and the location information c* (e.g., bounding box 
of the detections). Hence, Xi = (yi,.. c,) 1 , where 7 
affects the relative importance of the time difference 
compared to the location difference between the data 
points. A non-zero 7 ensures that the prediction of the 
position of a detection from its neighbors is consistent 
with both location and time-stamps of the neighbors, 
assuming that the targets move at constant velocity 

1. Effect of choosing different <5 is discussed in the supplementary 
material. 


in a small temporal neighborhood. We use 7 = 3 
pixels /frame. Our experiments (Figure [5] show that 
this choice has little impact on the performance. 

The neighbors Mi are defined to be the samples 
whose time indices fall within a small temporal win¬ 
dow of size T without falling under the gating 
constraint defined below to build the exclusion graph. 
T should be large enough to bridge local missed 
detections, but also small enough so that linear motion 
assumption holds. We use T = 10 frames. 

Appearance graph. In case of the appearance graph, 
Xi corresponds to an appearance feature (e.g., color 
histograms, etc.). Since we are considering the fact 
that a feature might occur only sporadically. Mi is de¬ 
fined to constitute all the samples except the samples 
that co-occur with the *-th sample and that do not 
have appearance features. 

Exclusion graph. This graph captures the con¬ 
straints associated to the fact that some detections 
cannot share the same labels. For example, two de¬ 
tections that occur at the same time instant should 
have different labels. This is usually referred to as 
time exclusivity. This information is encoded by setting 
Wij = 1 if U = tj. In addition, we can enforce the 
spatial constraint such that a target cannot have a 
large spatial displacement over short time interval. 
We encode this gating constraint by setting W,j = 1 
if \\d - Cj || 2 > Umaxl U - tj\, where v max is the maxi¬ 
mum speed of the target. Thus, M z comprises of the 
detections that either co-exist with the i-th detection 
or violate the gating constraint. 

2.3 Multi-object tracking as consistent labeling 
problem 

Given a set V of n vertices (i.e., the detections or 
the tracklets in tracking scenario), we consider that 
a label assignment Y = (y - { ..... y n ) 1 is defined to 
assign a m-dimensionaj^] label distribution y i € A m 
to the *-th node, where A m is the m-dimensional 
probability simplex. Each dimension of the label dis¬ 
tribution y i corresponds to a target. Formally, the k- 
th dimension, y,(fc), k = 1 , • • • , to , can be interpreted 
as the probability of the node i being the fc-th target. 
Consequently, Y is a row-stochastic matrix, with each 
row summing to unity. Therefore, we write Y £ V nm , 
where V n m is the set of all row-stochastic matrices 
of size n x to. We consider a graph Q = (V. 8. W) 
as explained earlier. This graph is assumed to assign 
large positive weights to edges that connect vertices 
that are likely to have similar labels (typically because 
they are close in time and space, or because they 
have similar appearance). In |24|, a harmonic function 
approach is introduced to measure the inconsistency 
of the label assignment matrix Y with respect to the 

2. Ideally, m should be equal to the number of targets plus one 
(for false positive). Since, in general, we do not know the number 
of targets a priori, we set m = n, considering the worst case in 
which each detection corresponds to a different target. 
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graph Q. Specifically, it measures the fy-norm of the 
difference between the labels assigned to nodes that 
are connected in the graph Q, and the labeling energy, 
also known as the harmonic energy |24| , is defined as 

n n 

E l (Y) := = Tt(Y t LY), (3) 

i =1 3 =1 

where Tr is the trace of a matrix and 1/ is the graph 
Laplacian, defined as L = D — W, where D is 
a diagonal matrix whose i-th diagonal element is 
Du := Wij. Due to the definition of weights 

in our graphs, we have Du = 1. Therefore, D is 
an identity matrix. For a graph with non-negative 
weights, i.e., Wij > 0, L is positive semi-definite and 
consequently the labeling energy in Equation (j3j is 
convex in Y. 

In our framework, we have K + 2 distinct graphs. 
As all the graphs have the same set of nodes, we 
frequently refer to a graph by its Laplacian L in the 
sequel. We represent the exclusion graph by L ! ', and 
other graphs by L\ + \l e {0, ...,A'}, where l = 0 
corresponds to the spatio-temporal graph and 1 < 
l < K corresponds to the l -the appearance graph. 
We explicitly introduce the minus (respectively, plus) 
superscript to emphasize that we would like to max¬ 
imize (respectively, minimize) the labeling energy on 
the corresponding graph. 

Given the measure of labeling energy on each 
graph, we want to define a label assignment Y* 
that minimizes the labeling energies due to L\ +) and 
maximizes the labeling energy due to L^~'. Mathe¬ 
matically, we have 

K 

Y* := argmin V E L (+) (Y) - E L ^ (Y) 

Y ^rn L ‘ 

= argmin £7 (+) (Y) - £7 i( _, (Y) (4) 

Y(=-'P ri m eff 

where := Ylf=o a lL^\ and ai > 0 weighs the 
contribution of labeling energy due to 1-th graph. The 
choice of ai depends on the scenario at hand, i.e., 
on the prior knowledge available about the relevance 
of the features. For example, while tracking sport 
players, the decrease in labeling energy associated to 
the color graph is not of primary importance since 
the players from the same team have similar col¬ 
ors. Hence, detections sharing the same color might 
correspond to distinct players/labels. In such case, 
it is meaningful to lower the weight assigned to 
the color graph as compared to the spatio-temporal 
graph. In other cases, for which a unique specific color 
is assigned to each target, a large weight should be 
assigned to the color graph to force the assignment of 
distinct labels to detections having different colors. 
Since ai > 0 and l\ ' 1 is positive semi-definite, 
is also positive semi-definite. Given Y* , the i-th node 
is assigned the label that corresponds to the largest 
entry in y*. 


3 Graph-consistent labels computa¬ 
tion 

In this section, we explain how to compute the so¬ 
lution Y* of the problem, defined in Equation (jdjl. 
First, we present a global label assignment solution, 
based on the difference of convex programming. Af¬ 
terwards, we introduce a node-wise optimization ap¬ 
proach to solve the problem efficiently. 

3.1 Joint label assignment optimization 

Let us rewrite Equation (J4]) as 

Y* = argminTr(Y T L^ ) Y) - Jt(Y T L^Y) 

Yev 

:= argmin [g(Y) := f(Y) - h(Y)] (5) 

Yev 

As and D ! are positive semi-definite ma¬ 

trices, both f(Y) := Tr{Y T L ( e fY) and h(Y) := 
Tr (Y T L^Y) are convex in Y, whereas g(Y) is non- 
convex. Specifically, Equation (|5j belongs to a family 
of difference of convex (DC) problems, and an iterative 
majorization-minimization algorithm can be used to 
solve the problem 1221, as presented in Algorithm [l] 
Starting with a random label distribution Y l l> € V, 
the algorithm iteratively linearizes h(Y) around the 
/;:-th iterate Y 1 k 1 and solves the resulting convex 
function f(Y) — V/i T Y using the projected 

gradient method [38|. The number of iterations Tj 0 i nt 
depends on the convergence tolerance. 

Algorithm 1 Joint label assignment optimization 
Input 

Graph Laplacians: {L^ + \l = 0,..., K}, D~) 

Scaling weights: (a;, l = 0,K} 

Number of iterations: boint 

Output 

Label assignment matrix: Y* 

Procedure: 

Choose an initial solution W 1 ' e V n m randomly. 

For k = 1, ...,'Z] 0 int 

Compute Vfi(V w ), gradient of h(Y) at WW 
Solve the convex optimization problem 

yO +1 ) <- argmin \f(Y) - Vfc T ('KW)ld 
rep„ m 

by the projected gradient method | [38) . 

End For 

Return Y* <- Y (1 Y<<'>. 


It is worth noting that the gradient of Tr (Y T LY) 
is (L + L r )Y. Therefore, both L and its transpose L 1 
are considered identically during gradient descent. 

Complexity analysis: Since there are n nodes, the 
graph Laplacian is a n x n matrix. Each node is 
assigned to a //(-dimensional label distribution. Con¬ 
sequently, Y is a // x m matrix. The projected gradient 
method 1381 performs gradient descent step followed 
by projection step for T p times. Each step has a naive 
complexity of 0{n 2 m), which can be improved to 
Ofkmn) if the graph Laplacian is Az-sparse. Thus, the 
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overall complexity is 0(rrrri r r p r r ]0mt ). The parameters 
T p and 7j 0 ; nt depend on a fixed tolerance value. 

The main disadvantage of the above solution is 
that its computational complexity grows quadratically 
with the number of nodes. Therefore, it cannot scale to 
large graphs. Furthermore, it can only handle off-line 
tracking problems because the optimization problem 
formulation is based on the whole graph. 

In the sequels, we describe how to circumvent these 
limitations based on a node-wise decomposition. 

3.2 Node-wise label assignment optimization 

To address the complexity issue of the joint label 
propagation algorithm, we adopt a node-wise de¬ 
composition of the objective function. That is, in¬ 
stead of solving a "big" and "global" optimization 
problem, each node updates locally and sequentially 
its label distribution to decrease the global objective. 
The approach is similar to the Gauss-Seidel iteration 
(or, co-ordinate descent approach). The advantages of 
such decomposition are twofold. First, the computa¬ 
tional complexity gets significantly reduced, making 
the framework applicable to large graphs, potentially 
based on parallel implementation. Second, as we solve 
the problem by iterating over the nodes, it becomes 
possible to handle tracking problems for which the 
graphs grow incrementally, as new detections are 
gradually computed along the time. 

In the remainder of the section, we first explain 
our proposed efficient and node-wise label propaga¬ 
tion solution, and derive the conditions under which 
the global objective function monotonically decreases. 
Afterwards, we introduce a strategy to scale up the 
algorithm using parallel implementation. 


3.2.1 Node-wise decomposition 

In this section, we first generalize the energy in 
Equation IBb by replacing the ||| y i — y 3 \\\ term by a 
convex ana symmetric function &(y ,, y ; ). Afterwards, 
we decompose the global optimization problem in 
Equation (J5]) into a node-wise optimization problem 
such that the high dimensional optimization problem 
is turned into a sequence of small problems in each 
node. In doing so, we derive the class of </> functions 
that guarantees monotonic decrease of the objective 
function^ 

Formally, replacing the t^-norm by <j> in Equa¬ 
tion 0, we write the objective function in Equation 0 
as 


#) = IE 

*=i j=i 


K 


Y.'xiWV-Wi 

1=0 

*=1 3=1 


(-) 

ij 


Hynyj) 


( 6 ) 


where we define := E^o ° l Ej * ~^ij ’■ Denot¬ 

ing Aij := Aij + Aj ir we then isolate the contribution 
of the p-th node as 

a(Y) = E W pfU(y P , Vj) + E E w ij S)< t>(yi> Vj) 

3 i^p 3 

= Y w if m y P > Vj)+Y Y Wif^iVi’Vj) 

3 i^P j^P 

(7) 

= y P (yi,y n ) + YY w if )< l > ( y i’ y j) ( 8 ) 

i^P 3^P 


where we assume <t>{Vi,yi) = 0 and 

<t>(Vi,Vj) = <S>{ypVi) in Equation 0, and we 

introduce g p {y 1 , ■ ■ • , yj := E jWpf^iVp’Vj) for 
brevity in Equation 0. 

Given Y (k) = (y 1 , ,• • • ,y {k) ) T £ V nm , we choose 
an index p £ {l,-- - ,n} and compute a new iterate 
Y (k+1) = (y[ k+1 \ ■ ■ ■ , yn +1) ) T £ V nm that satisfies 


.(fe+i) 


= Vi k) ^ 

£ argmin gi{y[ k \-- • , y, 
ye A m 


» yn ) 


if i^P, 

if i = p. 

(9) 


Then, by construction. 


<?cr (fc+1) ) = y p (y (fe+1 >) + E E yf ’) 

i^P j^P 

i^p j¥=P 

= g{Y (fe) ) 


Therefore, we conclude that under the following as¬ 
sumptions: 

• the loss function <j>(-, •) is convex, 

. the loss function is coincident^ i.e., ^(y^yj = 0 , 

• and the loss function is symmetric with respect to 
its arguments, i.e., t/Ky^yj) = 0 (yj,y»), 

the optimization step at any fixed node p 

y {k+1) £ argmin g p (y[ k \- ■■ , y, • • • , y {k) ) 
yeA m 

= argmin E W pf ] ) (1°) 

y G A m . 

monotonically decreases the objective function g(Y). 
Equation jToj is still a DC problem and it can be 
solved by using majorization-minimization technique, 
as discussed in Section 13.II It has to be noted that 
when (f> is chosen to be the G-norm, the above condi¬ 
tions are satisfied. 

The label propagation process is finally achieved by 
sequentially updating the label distribution over the 
nodes, possibly T con > 1 times, until g(Y (k ^) does not 
decrease any more. We summarize the overall process 
in Algorithm [2] Note that we do not assume anything 


4. The coincidence property will make the loops irrelevant and 
3. Detailed derivation is provided in the supplementary material. generally we do not need loops in the graph. 
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about the structure of the graph, thereby allowing 
loops in the graph. 


Algorithm 2 Node-wise label assignment algorithm 
Input 

Weight matrices: {W^ l \ l = 0, ■ ■ ■ K}, vW - ) 

Scaling weights: , l = 0, - - - K} 

Number of iterations: T con 

Output 

Label assignment matrix: Y* 

Procedure 

Set W( e ®) <- J2i aiWM - W ( ~> 

Set W (eS) <- vW eff ) + w (off ) T 
Choose initial solution, yW 6 Vnm 
Set k <— 1 
For t = 1, • • • ,Tcon 
Initialize U <— V 
While U + 0 
Select a node p from U 

Solve y <- argmiu^j WjE <j>{y, yf^) 
y€A.m 

Y(k+1) + _ (y W ... , yy W 1( ... , y W)T 

w «- u \ {p} 

k <— k + 1 

End While 
End For 

Return V" <— yPam) 

Note: we have observed that the order in which p is chosen from U 
does not affect the labeling energy much. Consequently, we chose 
nodes in the sequential order. 


Complexity analysis: Each node solves a tri¬ 
dimensional DC program using the projected gradient 
method. Let the number of iterations required for 
the convergence of the projected gradient method be 
T p i, which is comparable to T p in Section 3.1 The 
complexity of the DC optimization in a specific node 
is therefore O (mT p i ). Since there are n nodes and since 
we traverse the nodes Tcon times, the overall complex¬ 
ity is 0(mnT p fT c on ). From experiments, we have seen 
that T con -< 7j 0 j nt . Comparing with the complexity 
of joint approach, which is 0(n 2 mT p Tj 0 i nt ), the node¬ 
wise decomposition approach has an improvement 
of 0(nTj 0 i nt /T con ), which becomes significant as n 
increases, making it a better choice for large-scale 
problems as confirmed by our experiments. 


3.2.2 Parallel implementation 

The node-wise decomposition of the objective func¬ 
tion also paves the way for a parallel implementation 
of the label optimization. This allows our proposed 
approach to scale up further with the size of the 
graph. In this section, we first derive a condition un¬ 
der which the parallelization of the coordinate descent 
decreases the objective function. 

We denote the set of nodes for parallel descent by 
J and its complement by J := V \ J. Then, we 


decompose the objective function as [^] 
a (Y) = 5>(r)-EE 

i&J i&J j&J 

+ J2J2 w ij S)< P(y^ vj) (ii) 

i&J jej 

The negative terms in Equation ( [TTj ) are called in¬ 
terference terms. To nullify these terms, we pickup the 
nodes in J such that there are no edges between them, 
i.e., V{i,j) G J x J, = 0. Under this condition, 

we can write 

g(Y) = 9i(X) + E E Vo) (I 2 ) 

icj jcj 

and solve the local optimization problem 

y\ k+1) G argmins^yE ■ • • , y, ■ • • , y { n k) ) (13) 

in parallel for each node i G J. Then, the resulting 
label assignment matrix T^ +1 >, defined as 

(k+i) f € argminyj(yE ■ ■ ■ , y, • ■ • , yn ] ) if i G J, 

y\ ’ < 

\=y[ k ' > otherwise, 

decreases monotonically the objective function, i.e., 
g(Y {k+1) ) < g(Y {k) ). As a consequence, as long as the 
nodes that are processed in parallel are not neighbors, 
a monotonic decrement of the objective function is 
guaranteed. In Section [hi) we demonstrate the benefit 
of parallelization with a simple yet effective batch- 
based scheduling approach. 

4 From off-line to incremental label 

PROPAGATION 

In previous sections, we described the off-line graph 
construction and label propagation steps. However, in 
many real-life applications, detections arrive progres¬ 
sively along the time. To handle such scenarios, while 
being as close as possible to the off-line formalism, 
we embed the node-wise label propagation within 
an incremental graph construction process. Once the 
novel detections arrive, the graph is incremented by 
incorporating them. Afterwards, we re-optimize the 
label distribution by iterating over the nodes using 
the node-wise decomposition. 

In the incremental graph construction, we do not 
have access to the future samples. Consequently, the 
LLE-based graph construction of Section |2.2| cannot 
be used. This has two implications. First, we need 
to define an explicit strategy to gradually incorporate 
new targets in the scene. Second, the implicit linear 
motion model cannot be embedded while construct¬ 
ing the spatio-temporal graph since future detection 
locations are not known at construction time. 

The remainder of the section first explains how 
new detections are connected to the existing nodes. 

5. Detailed derivation is provided in the supplementary material. 
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It then describes how labels are propagated through 
the incremented graph. 

4.1 Incremental Graph Construction 

We assume that the detections arrive sequentially 
along the time. Let the set of detections at time t be 
denoted by Also, let the graph up to time t — 1 
be W^ -1 )). Since the graph 

evolves with time, we denote the number of nodes 
and the size of the label vector explicitly by n- 4) and 
TO (t) respectively. 

The incrementation differs depending on whether 
we are dealing with the spatio-temporal graph, the 
appearance graph(s) or the exclusion graph. In all 
graphs, the new detections are first added to the set 
of vertices V*- 4-1 ) to generate V !:t> . Edges and weights 
are incremented separately for each graph as follows: 

Exclusion graph. We create new edges between the 
nodes that occur at time t,. Also, we create edges from 
the nodes at time t to the existing previous nodes if 
they are not within the gating region. Each exclusion 
edge has a weight 1. 

Spatio-temporal and appearance graphs. We con¬ 
nect each node at time t with the nodes in a window 
[t — T c ,t\, where T c is the connection window size. 
Large T c results in dense graphs whereas small T c 
results in sparse graphs. Once the neighborhood is 
defined, we assign a weight W i3 between a novel node 
i and an existing node j as 

w .. _ / exp(-^d{x u x 3 ) 2 ) if \U-tj\ < T c , 

,J ( 0 otherwise, 

where ti and x, denote the time instant and the fea¬ 
tures of the i-th node respectively, di-, •) measures the 
dissimilarity between the features x t and x 3 , and a is 
a scaling parameter. T c and er parameters are adapted 
to each kind of graph. In our experiments, T c is set to 
10 frames for the spatio-temporal graph ( as in off-line 
graph construction ), but is extended up to 200 frames 
in the appearance graph to bridge the gaps caused by 
the sporadic nature of the feature. The parameter a 
should be larger than the typical distance measured 
between the features of two detections corresponding 
to the same targets, while being smaller than the 
typical distance measured between distinct targets. In 
practice, our values for a have been selected by look¬ 
ing at the two distributions of distances between pairs 
of detections that correspond to the same / different 
targets ^specifically, we use a = 20 in the spatio- 
temporal graph and a = 0.05 in the appearance graph. 
Also, we use d(-, •) := || • — • ||2 but any other distance 
measure can be envisioned. 

6. These distributions should ideally be derived from ground- 
truth data. When such a ground-truth is not available, we shwon 
in the supplementary material that reasonable a can simply be 
inferred by comparing two distributions of distances measured 
between either neighboring or co-existing detections. Another al¬ 
ternative consists in building on reliable tracklets to identify pairs 
of detection corresponding to similar/distinct targets. 


To account for the cases in which some detections 
(nodes) are likely to correspond to new targets, we 
introduce a virtual source node in the graph. This 
source node is connected to every node in the spatio- 
temporal graph. The weight of the edge connecting 
the source node to the i-th node is represented by 
wf\ This weight depends on the prior knowledge 
we might have about where and / or when a target 
is likely to appear in the field of view. In our case, 
we consider that a new target appears either in the 
beginning of the tracking process, or when entering 
the scene on the borders of the image. Therefore, the 
weights should be large for the detections that are 
close to the image border and / or that appear in the 
beginning of the tracking. For the i-th detection, 
we compute the smallest distance d \ mlD ■* from the 
detection to the image border. Then, we compute 


„( s 


w- ' by replacing d(-,-) by in Equation < 14>. 


Note that when some prior knowledge is available 
about the appearance of the targets entering the scene, 
e.g., because the digit of the players sitting on the 
dug-out in team sport games is known, edges to 
the source node could be defined in the appearance 
graph as well. Once the weights are defined, they are 
normalized such that Sje{A/' i Us} ^ u = !• 

4.2 Label propagation in the incremented graph 

After incrementing the graphs, we perform node-wise 
label propagation. We denote the labels distribution 
over £/(*) after k iterations of the label propagation 
process by Y^ t,k \ Moreover, 7^ denotes the labels 
distribution after the convergence of the propaga¬ 
tion process at time t. We first initialize the label 
distribution matrix at time t, denoted by Y 11 A l , 
by augmenting the label distribution matrix at time 
t— 1 , denoted by a rh 4-1 ) x m ( 4 “ 1 ) -dimensional matrix 
T^ 4-1 ^, as follows: 


v (t, 1) _ ^ I Ora(*- 1 >x|X>M| 


(15) 


where 0 „( t — 1 ) x |x>(*) | is a n^ 4-1 ) x | 2 ?( 4) |-dimensional 
zero matrix and is a |£>( 4 )| x (|Xh 4 )| + mf 4-1 ))- 

dimensional matrix such that Lyy = 1 /(| 2 ?( 4 )| + 
m( 4-1 ' 1 ). Obviously, U 1 * 1 is a (uniform) row-stochastic 
matrix, and a uniform label distribution is assigned 
to the novel nodes. 

After initialization, we iterate over all the nodes 
(except the virtual source node) and solve the node¬ 


wise optimization problem, introduced in Section 3.2 

(£,&+ 1 ) . / ( t,k ) 

Vi ' e argming;(y) 


y eA 


.(*) 


»y>- 




> y»n(4) ) 


(16) 


where e, £ A m ( t > is a singleton vector having 1 
at the z-th index and zero elsewhere . It promotes 
the assignment of a new label to the i-th node when 


(s) -1 

W) « 1. 
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To bound the complexity of our incremental frame¬ 
work, and to turn it into an on-line procedure, we 
consider a sliding window [t — T 0 ,t] and forget the 
history of the graph outside the window. Afterwards, 
the distributions of nodes that lie outside the window 
are frozen, and the node-wise optimization, defined 
in Equation (16) , is only considered for the nodes that 
belong to the window. The window size T a trades- 
off the tracking accuracy and the computational (and 
memory) resources. 

5 Related work 

This section provides a brief review of the recent and 
related works under the following categories: 

Label propagation in graphs. Propagation of labels 
in a graph is often used in semi-supervised learning 
approaches, and a concise survey of recent develop¬ 
ments in this field can be found in f26| and references 
therein. In short, most of these approaches assume 
that the label of a node is approximated as the linear 
combination of the labels of its neighbours [311. In 
P I, the authors use a mixed label propagation in 
which (i) they measure the bipolar similarity ( e.g ., 
Karl Pearson's correlation coefficient that lies in the 
range [-1,1]) between the samples, and (ii) construct 
a 'positive' and a 'negative' graphs based on the sign 
of the coefficient. Afterwards, they minimize the ratio 
between labeling energies due to the positive and neg¬ 
ative graphs. This is done by semi-definite relaxation 
to assign a binary label to each node of the graph. Our 
method differs from |30j both in the definition of the 
graph similarities, and the label propagation method. 
Specifically, since we use multi-class labels instead of 
binary labels, and impose that the label distribution 
at each node should lie on a probability simplex, our 
problem is difficult to cast into their formalism. 

Message passing. Message passing (belief propaga¬ 
tion) approaches have been used to label the nodes in 
a graph in tracking/recognition |6j, [32], image com¬ 
pletion scenario GD' etc. Each node gathers messages 
from its neighbors, optimizes locally a problem, and 
then transmits its message. This approach has been 
shown to be exact in trees but the convergence is not 
guaranteed in presence of loops. In contrast, we do 
not assume any structure of the graph to guarantee 
the convergence of our approach. 

In [321, a subset of the nodes are initially labelled 
and then a CRF is used to infer the label of the re¬ 
maining nodes. For this, the authors compute various 
appearance features and assume that the features are 
always available with similar accuracies. Hence, their 
approach cannot exploit appearance features that are 
sporadic or affected by non-stationary noise. In |6], 
the authors utilized such non-stationary and sporadic 
features to prioritize the propagation of belief related 
to the label probability distribution. Even though this 
approach exploits sporadically available appearance 
features, it relies on the assumption that the target 


appearances are known beforehand, which is not the 
case of our approach. 

Mutual exclusion. Mutual exclusion has been con¬ 
sidered in |33) , 1341 to learn discriminative appearance 
features. In these papers, first of all, a low-level but 
reliable tracker is used to connect unambiguous de¬ 
tections into tracklets. Afterwards, positive samples 
are defined by pairs of detections that belong to the 
same tracklet, while negative samples correspond to 
pairs that belong to tracklets that likely correspond to 
distinct objects (because they overlap in time). Lastly, 
these samples are used to train an AdaBoost J36fl , 
which in turn selects the discriminative appearances. 
This work is orthogonal to our proposal since it could 
help our approach to select the discriminative features 
while defining the appearance graph(s). 

In |9J, [ 101, the authors define a mutual exclusion 
term based on the physical distance between two 
detections that occur at the same time. The term 
goes to infinity as the distance goes to zero. This is 
motivated by the fact that two objects cannot occupy 
the same space simultaneously. Our formulation is 
different in that our mutual exclusion term is defined 
in terms of the similarity in the label distribution 
rather than the position. 

Distributed proximal optimization: Our label 
propagation method by node-wise optimization can¬ 
not be truly characterized as a distributed computa¬ 
tion but it raises this possibility for future develop¬ 
ments. In such a scenario, we noticed that in [ 15 ], 
the authors devise a proximal optimization on graph 
that has quadratic convergence by using the Nes¬ 
terov's method [191. Knowing if their approach, which 
assumes positive graph weights for forcing convex 
optimization, can be adapted to general weights and 
DC minimization is a matter of future study. 

Laplacian eigenmaps latent variable model 
(LELVM): LELVM [14] defines an out-of-sample map¬ 
ping of the Laplacian eigenmaps. Given a graph, in 
which the weight of an edge Xi ~ Xj is constructed 
as Wij := exp(—Ha:, — XjW^/cr 2 ), the latent points Y 
are the solution of 


minimize Tr(Y r TY) 
subject to Y £ 


IxL 


Y ' DY = I,Y' D1 = 0. 


where D is a diagonal matrix with its i-th diagonal 
element defined as Da := JV W»j, and L:— D W 
is the graph Laplacian. When a new sample x arrives, 
1141 defines an out-of-sample mapping F{x) = y for 
a new point * as a semi-supervised learning prob¬ 
lem, by recomputing the embedding as in previous 
equation (i.e., augmenting the graph Laplacian with 
the new point), but keeping the old embedding fixed. 
LELVM has been used for tracking human pose in 
181. Our incremental label propagation is similar to 
LELVM in the sense that we also augment our graph 
and then solve for the "latent" label distribution. 
However, LELVM cannot handle newly occurring tar- 
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gets as it assumes that the new sample x belongs to 
one of the classes defined by X. Moreover, it keeps 
the old "latent" distributions unchanged, which is not 
the case in our approach. 

6 Evaluation 

The proposed algorithm has been evaluated on 
the following well-known and challenging datasets: 
APIDIS Q, PETS-2009 S2/L1 0, TUD Stadtmitte 
0 and TUD Crossing J4j|. APIDIS is a multi-camera 
sequence acquired during a basketball game, whereas 
the other three are monocular sequences. 

In the remainder of the section, we first describe 
these datasets. We then discuss the evaluation metrics 
and the implementation details. Finally, we present 
our results and compare them with several state-of- 
the-art methods. 

6.1 Datasets 

APIDIS dataset. This 1-minute video dataset is gen¬ 
erated by 7 cameras, distributed around a basketball 
court. The candidate detections are computed inde¬ 
pendently at each time instant based on a ground oc¬ 
cupancy map, as described in [351. For each detection, 
the jersey color and its digit are computed to define 
the appearance features. In short, the jersey color is 
computed as the average blue component divided 
by the sum of average red and green components, 
over the foreground silhouette of the player within 
the detected rectangular box. The digit feature is 
obtained by running a digit-recognition algorithm |39) 
in the same rectangular region. The digit feature is 
inherently sporadic as it is available only when the 
digit on the jersey faces the camera. 

Pedestrian datasets. To evaluate the performance 
of our method in monocular views, we use publicly 
available PETS-2009 S2/L1, TUD Stadtmitte and TUD 
Crossing datasets. The PETS dataset is 795-frames 
long, with moderate target density. However, the 
pedestrians wear similar dark clothes, which makes 
appearance comparison very challenging. TUD Stadt¬ 
mitte and TUD Crossing are 179 and 201 frames long 
respectively but the targets frequently occlude each 
other because of the low view-point. Detection results 
and the ground-truth are obtained from |5j. After¬ 
wards, 8-bin CIE-LAB color histograms are computed 
for each channel of each bounding box, resulting in a 
24-bin appearance vector. We ignore the histogram(s) 
if the overlap ratio between any two bounding boxes 
exceeds 5%. This is done because the histograms are 
less likely to represent the target color correctly in case 
of overlap, and might thus lead to wrong associations 
between the detections. Since the histogram feature is 
not available for every detection any more, it becomes 
sporadic. 


6.2 Evaluation metrics 

We use CLEAR MOT metric 113 ] to evaluate our 
approach. It defines two quantities namely multiple 
object tracking precision (MOTP) and multiple object 
tracking accuracy (MOTA). 

MOTP is defined as the total error in estimated 
position for matchecQ ground-truth and track pairs 
over all frames, averaged by the total number of 
matches. MOTA measures the number of misses, false 
positives, re-initializations and identity switches. A 
miss means that the tracker does not have a matching 
estimate for a ground-truth. Similarly, a tracker output 
is called a false positive when no matching ground 
truth is available. A switching error occurs when the 
tracker starts following another object, whereas a re¬ 
initialization error occurs when the tracker fails to 
track the object at same time and a new track is 
assigned for the same object later. The error due to 
switching is more problematic as it might lead to 
significant errors in higher level interpretation. 

Usually, MOTA is often preferred over MOTP be¬ 
cause MOTP depends on the accuracy of target de¬ 
tector and on the accuracy of the ground-truth anno¬ 
tations. In our table, due to its importance regarding 
long term tracking capabilities, the number of switch¬ 
ing errors (SW) is also reported. 

6.3 Implementation details 

Both the joint and node-wise label propagation algo¬ 
rithms have been implemented on MATLAB running 
on a 2.4 GHz quad core CPU with 4 GB RAM. 
The parallel implementation of the node-wise label 
propagation has been done separately in C++ using 
Boost Graph Library and OpenMP. 

Pedestrian datasets. For these datasets, a node is 
assigned to each individual detection. The size of the 
temporal neighborhood in spatio-temporal graph is 
chosen to be 10 frames. Thus, T = 10. When pro¬ 
cessing time is an issue, we can envision processing 
the dataset in batches or running a low-level but 
reliable tracker first to reduce the complexity (which 
we perform in the APIDIS dataset). 

APIDIS dataset. We first pre-process the data by ag¬ 
gregating some of the detections into tracklets based 
on a spatio-temporally local but reliable tracker. The 
local but reliable tracker associates two detections 
between successive frames into a tracklet when they 
are separated by less than 15 cm and there is no other 
detection that is closer than 15 cm from any of them. 
The resulting tracklets define the nodes in our graphs. 
The neighborhood of the spatio-temporal graph is 
defined to connect the tracklets within 100 frames 
on each side, which allows us to connect tracklets that 
are up to 4 seconds apart. In the exclusion graph, 

7. A tracker output and the ground-truth are defined to be 
matched if their intersection-over-union ratio exceeds 50% (respec¬ 
tively, if the distance < 30 cm for APIDIS. The threshold value of 
30 cm is recommended for APIDIS dataset.). 
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the neighborhood of a node consists of all the nodes 
that overlap in time. Finally, the appearance features 
of a tracklet is inferred by averaging the appearance 
features of the detections along the tracklet. 

Post processing. Once the label propagation step is 
over, we filter out some tracks that satisfy one of the 
following criteria: 

• the number of detections along the track is less 
than 10 frames, 

• the track is primarily composed of low confi¬ 
dence detections, i.e., if the maximum confidence 
value along the track is less than 0.8. 

The reasons behind these heuristics are that false 
tracks are usually shorter than regular target tracks 
and that the false positive detections have lower 
confidence values, compared to the true detections. 
This case is prevalent in PETS and both TUD datasets. 


A glimpse of running times is presented below: 


Dataset 


Time taken 


Low-level 

tracker 

Graph 

construction 

Label propagation 
Joint Nodewise 

TUD Stadtmitte 

- 

2 min 

3 min 

25 sec 

TUD Crossing 

- 

155 sec 

167 sec 

31 sec 

PETS 

- 

3 min 

40 min 

5 min 

APIDIS 

15 sec 

1 min 

5 min 

1 min 



Method 

MOTA 

MOTP 

SW 

<u 

Continuous energy [91 

60.5 

65.8 

7 

.ts 

Discrete-continuous tD-C) 1101 

61.8 

63.2 

4 

S 

GMCP tracker (23] 

77.7 

63.4 

0 


Joint (no appearance) 

62.7 

73.5 

17 


Joint (with appearance) 

79.2 

73.9 

4 

D 

Node-wise (no appearance) 

63.1 

73.6 

16 

H 

Node-wise (with appearance) 

79.5 

73.9 

4 


Discrete-continuous (D-C) |10] 

57.3 

73.7 

13 

c? 

Continuous energy (91 

61.6 

73.2 

28 

Cfi 
cft 

GMCP tracker |23] 

91.63 

75.6 

0 

4h 

Joint (no appearance) 

62.5 

74.3 

12 


Joint (with appearance) 

65.4 

75.4 

8 

D 

Node-wise(no appearance) 

62.3 

74.3 

13 

H 

Node-wise(with appearance) 

65.4 

75.2 

8 


Discrete-continuous (D-C) |10] 

89.30 

56.40 

- 


Continuous energy |91 

81.84 

73.93 

15 


K-shortest paths J12| 

80.00 

58.00 

28 


GMCP tracker |2;3j 

90.30 

69.02 

8 

<75 

Global appearance (GA) [211 

81.46 

58.38 

19 

W 

Ph 

Iterative hypothesis (IH) |«1 ' 

83.0 

74.0 

N/A 


Joint (no appearance) 

8277 

71.21 

2b 


Joint (with appearance) 

91.04 

70.99 

5 


Node-wise (no appearance) 

83.07 

71.23 

25 


Node-wise (with appearance) 

91.04 

71.00 

5 


TABLE 2 

Tracking results on the TUD Stadtmitte (179 frames), 
TUD Crossing (201 frames) and PETS 2009-S2/L1 
(795 frames) datasets. The D-C, IH, GMCP, KSP and 
GA results are obtained from |[8], (T0], (21), (23] . 


6.4 Results 


In this section, we first present the tracking results 
for our frameworks, applied to offline-constructed 
graphs. Then, we present the tracking results for the 
incremental graph construction and label propaga¬ 
tion. The computational advantages due to the node¬ 
wise decomposition and parallelization are presented 
afterwards. Then, effects of parameters are discussed. 
Lastly, some qualitative results are presented. 


6.4.1 Tracking results for offline-constructed graphs 
To better compare with the literature, we consider two 
versions of the method. The first one uses only the 
spatio-temporal information. Thus, we construct only 
the spatio-temporal and the exclusion graphs. This is 
equivalent to setting a 0 = 1 and a p = 0,Vp ^ 0 in 
our algorithm. In contrast, the second one considers 
both the spatio-temporal and the appearance features. 
For the TUD Stadtmitte , TUD Crossing and PETS 
datasets, we use oq > oj (no for the spatio-temporal 
graph and a.\ for the appearance graph). This con¬ 
strains the spatio-temporal consistency more strictly 
than the appearance consistency. The reason is that the 
targets wear similar clothes and therefore have similar 
appearances in the datasets. In the experiments, we 
use ao = 1 and a\ = 0.50 

We compare our results with several methods 
such as the continuous energy (CE) minimization |9], 
the discrete-continuous (D-C) minimization [ 101, the 
GMCP tracker [23|, the A'-shortest paths (KSP) 112], 
the global appearance constraints (GA) [ 211 and the 


8. We varied ai S [0.1,1] but did not observe significant perfor¬ 
mance changes. 


iterative hypothesis testing (IH) |8|. The CE and D- 
C trackers estimate the most probable trajectories by 
minimizing their energies that consist in a combi¬ 
nation of observation energy, dynamic energy, mu¬ 
tual exclusion energy, track persistence energy, etc. 
In addition, the D-C tracker uses cubic splines for 
modeling the motion of the target, and favors the 
reduction of the number of trajectories. GMCP solves 
greedily a generalized minimum clique problem to 
extract tracklets that have the most stable appearance 
features and the most consistent motion. KSP solves 
a network-flow formulation of the tracking problem 
and minimizes the sum of pairwise association costs 
between consecutive detections to estimate K tracks. 
GA improves KSP by incorporating appearance in¬ 
formation. IH embeds an hypothesis testing strategy 
into a greedy shortest-path computation procedure 
to exploit the appearance features that are unreliable 
and /or sporadically available. Since C-E, DC and KSP 
trackers do not use appearance information, we com¬ 
pare them with the first version of our approach that 
does not use appearance features. Similarly, since GA, 
IH and GMCP exploit the appearance features, we 
compare them to the second version of our approach. 

In Table [2j we first observe that the joint and node¬ 
wise label optimization approaches give similar per¬ 
formances. For TUD Stadtmitte dataset, our method is 
better than previous methods both in terms of MOTP 
and MOTA. This is because our approach is able to 
connect the detections even if they are far in time, 
resulting in longer and consistent tracks. However, 
our method is slightly worse than GMCP in terms 
of ID switches. This might be because GMCP uses 
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motion information in a global manner to ensure a 
smooth displacement while connecting the tracklets, 
which is not the case in our formalism. 

In case of TUD Crossing dataset, our method out¬ 
performs CE and D-C. Surprisingly, GMCP has re¬ 
ported outstanding results. GMCP does not describe 
how the detections have been obtained. Our methods 
use same detections than CE and D-C, which has 
been obtained from the MOTChallenge [4j . We have 
observed that removing the unreliable (confidence < 
0.6) detections from MOT challenge already reduces 
the MOTA of an oracle tracker to around 0.7, while 
keeping all the detections introduces many false posi¬ 
tives. Hence, even if we were unable to run the GMCP 
code to verify it, we suspect that the results reported 
by GMCP are based on a better set of detections than 
the MOTChallenge ones. It is explained in detail in 
the supplementary material. 

In case of PETS dataset, again we observe that our 
proposed approach outperforms most approaches. 
When the appearance features are ignored, the MOTA 
metric is better than KSP but worse than D-C. This 
might be because of the fact that D-C exploits higher- 
order motion models, whereas our formalism does 
not. We assert the fact that a linear motion is implicit 
in our formalism to justify our superior performance 
against KSP and GA, which do not take the motion 
information into account. When the appearance infor¬ 
mation is incorporated, the performance is improved 
significantly from 82% to 91%. Moreover, the switch¬ 
ing error is drastically reduced. 

The results for the APIDIS dataset are presented 
in Table |3j Since GA and IH are the only methods 
from the literature that are able to exploit sporadic 
appearnace features, we focus the comparison with 
them. As before, first we computed the results with¬ 
out using any appearances. This is done by setting 
ao = l,«i = 0, 0-2 = 0, where the indices 0, 1 and 
2 correspond to the spatio-temporal, the color and 
the digit graphs respectively. Afterwards, we use both 
the digit and the color features. As the color feature 
is less discriminant (because the players from the 
same team wear jersey of the same color) than the 
digit feature, we set a\ < a-j- Empirically, we use 
ao = l,ai = 0.1,a2 = 0.5. 

Although our approach performs significantly bet¬ 
ter than GA, the results are slightly worse than IH. We 
see two potential reasons for this. First, our graph con¬ 
struction method assumes that the features are always 
reliable (whenever they are present). This is not the 
case for the IH that takes into account the confidence 
of feature measurement while connecting two nodes. 
Doing so, it lowers the impact of noisy appearance 
features as compared to the reliable ones. Second, 
IH associates two nodes only when the connection is 
sufficiently reliable than alternative connections. This 
prevents potential track switches, as reflected by the 
switching errors. 


Method 

MOTA 

MOTP 

sw 

IH (no appearance) |8| 

85.83 

60.83 

18 

IH (color+digit) |8| 

86.19 

60.90 

12 

GA (no appearance) |211* 

72.91 

53.13 

108 

GA (color+digit) [21] 

73.07 

53.15 

110 

Joint (no appearance) 

81.27 

57.13 

49 

Joint (color+digit) 

83.90 

60.04 

45 

Node-wise (no appearance) 

81.4 

57.17 

49 

Node-wise (color+digit) 

83.85 

60.01 

45 


TABLE 3 

Results on the APIDIS dataset (1500 frames) . The 
tracking results for IH and GA are been provided by 
the authors. [*] Since the detection results for |21] are 
different than that for the (8] and ours, we relax the 
distance threshold to 40 cm (from 30 cm) for the 
tracking results of |2T) . Detailed results are provided 
in the supplementary material. 

6.4.2 Tracking results for incrementally constructed 
graphs 

We constructed the graph as described in Section [4~T| 
and performed incremental label propagation. The 
construction of the graph in case of APIDIS dataset 
is slightly different than the other two datasets. In 
this case, if new detections can be unambiguously 
matched to the existing nodes, they are aggregated 
into a single tracklet. Otherwise, we create new nodes 
for the detections and connect them with existing 
nodes. The tracking results are presented in Table [4] 
We observe that the tracking accuracy of the incre¬ 
mental approach is slightly worse than the off-line 
method. This reveals the importance of embedding 
a linear motion model during graph construction. 


Dataset 

Appearance feature 

MOTA 

MOTP 

SW 

PETS 

No 

79.32 

70.70 

26 

Yes 

86.56 

71.40 

6 

TUD 

No 

61.60 

73.30 

13 

Stadtmitte 

Yes 

77.20 

73.40 

2 

TUD 

No 

61.2 

72.1 

19 

Crossing 

Yes 

63.4 

72.3 

12 

APIDIS 

No 

Yes (color+digit) 

74.40 

80.23 

54.20 

58.45 

52 

47 


TABLE 4 

Results of the incremental graph construction and 
label propagation approach. 


To trade-off the complexity with the quality of the 
incremental solution, we considered only the nodes 
which lie within the observation window [t — T 0 ,t] 
to perform label propagation. The rest of the nodes 
were frozen, meaning that the node-wise optimization 
was not performed on those nodes. The results are 
elucidated in Figure [2] for the TUD Stadtmitte dataset. 
As we can see, the processing time monotonically 
increases with the size of the observation window. 
However, the tracking accuracy is improved only upto 
some value (50 frames in Figure [2j after which it sat¬ 
urates. Alternatively, one could define other heuristic 
to freeze the nodes. For example, one could decide 
to freeze a node if the change in its label distribution 
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over time is smaller than some pre-defined threshold. 



Fig. 2. Trade-off between the processing time and 
the tracking accuracy for different observation win¬ 
dow size for the TUD Stadtmitte dataset. 

6.4.3 Computational advantages of the node-wise 
decomposition and parallelization 
To study the effect of node-wise decomposition, we 
constructed the graph off-line with different number 
of frames. Once the graph was constructed, we used 
both joint and node-wise approaches for label propa¬ 
gation with 10 random initializations. Afterwards, we 
computed the processing times for both approaches to 
reach the same labeling energy (equal to the labeling 
energy of the joint optimization after convergence). 
The results are shown in Figure [3] We can see the 
dramatic improvement in computational speed, es¬ 
pecially when the size of the graph increases. We 
observed that one iteration (over the whole graph) of 
the node-wise label optimization appears to reduce 
the labeling energy much faster than one iteration of 
the joint optimization. 




Fig. 3. Processing times for the joint and the node-wise 
approaches for different size of the graph. 


To assess the advantages offered by the parallel 
implementation, we consider a simple scheduling 
strategy, which directly follows the non-interference 
condition (see Section |3.2.2 1 and selects the nodes 
at random. For each number of processor, we ran 
the algorithm 10 times and noted the evolution of 
objective function. The results are depicted in Figure]!] 
The reported time is different from Figure [3] because 
of the fact that the parallel implementation is done in 
C++. Although the parallel implementation decreases 
the computational time, we observe that the reduction 
is not proportional to the degree of parallelism. This 
sub-optimal speed-up factor is due to the fact that 
we run the algorithm in batches of nodes. As a 
consequence, the time required to process a batch 
is governed by the longest time taken by one of its 
nodes. The algorithm for node-selection strategy and 
the distribution of time taken by nodes in the batch 
are presented in the supplementary material. 



Fig. 4. Processing time and speed-up factors for 

different number of processors (P = 1 to P = 4) of 
the TUD Stadtmitte (top row) and PETS (bottom row) 
datasets. For each case, we perform 10 runs of the 
algorithm which are drawn with the same color. 

6.4.4 Effect of parameters 

Our algorithm has some key parameters. They are 
listed in Table [l] The effect of ai and T a have already 
been discussed in Section 16.4.11 and Section 16.4.21 In 
this section, we consider T, T c , 7 and v max and discuss 
what are their effects on the performance. For this, 
only one parameter is changed at a time and all other 
parameters are fixed at their reference values. Figure [5] 
presents our results. In all graphs, the blue and green 
curves depicts the MOTA and the computational time 
respectively. In the first column, which considers 
the incremental algorithm, this computational time 
reflects both the graph construction and the label 
propagation, since they occur jointly all along the 
process. In the three last columns, which refer to the 
off-line algorithm, the green curve measures the graph 
construction time only. 

From Figure |5j we observe that increasing T c in¬ 
creases the computation time. Flowever, the MOTA 
is improved only up to some value (100 frames in 
our experiments) after which it starts decreasing. This 
is mainly due to the fact that the chances of wrong 
associations increase with large T c . 

Since the parameters T, 7 and v max do not affect 
the construction of the appearance graph, we report 
the time taken for the spatio-temporal graph only. We 
observe that increasing T increases the connectivity of 
the graph (which leads to increased time to construct 
the graph). We observe that the MOTA increases up 
to certain value of T and then starts decreasing again. 
On the one hand, when T is small, it might not be 
effective to bridge the local missed detections. On 
the other hand, a large T is not only more prone 
to wrong connections but also might not satisfy the 
linear motion model assumption. Interestingly, 7 does 
not seem to affect MOTA much. From Figure |5j 
we also observe that 7 does not affect the graph 
construction time when a small window T = 10 is 
considered. Flowever, we have observed that its effect 
is significant when T increases. As an example, the 
graph construction time for 7 = 1 is around 10 times 
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Incremental graph construction 
and label propagation 


Off-line graph construction 



50 100 200 Inf 



Fig. 5. Effect of parameters on TUD Stadtmitte (top) and PETS (bottom) datasets. Each plot shows the 
effect of changing a single parameter while keeping other parameters fixed. Red squares correspond to the 
reference parameter values used in our experiments. The parameter is mentioned on the bottom right corner. 


more than that for 7 = 7 when T is set to 100 frames 
for TUD Stadtmitte datasetj^] This is because a large 
7 reinforces the implicit linear motion assumption 
embedded in Equation (JTjl, which in turn restricts 
the number of neighboring nodes that remain eligible 
for non-zero weights, leading to sharp reduction in 
the graph construction time. Finally, reducing « max 
typically reduces the time to construct the graph as 
it discards many detections that violate the gating 
constraint from the neighborhood. On the flip side, 
these detections receive non-zero weights in the exclu¬ 
sion graph and they receive different labels, resulting 
in reduced MOTA when v max becomes too small, 
i.e., typically below the reference value of 10. When 
w ma x increases beyond the reference point (in red), it 
increases the chances of wrong associations, resulting 
in lower MOTA. 

6.4.5 Qualitative results 

Now, we present some qualitative results. Figure [6] 
depicts the detections, constructed graphs and the 
inferred labels. Due to lack of space, we present the 
sample frames and discuss the failure cases in the 
supplementary material. 

7 Conclusion and future works 

In this paper, we have focused on the multi-object 
tracking (MOT) problem under sporadic appearance 
features. For this purpose, a number of complemen¬ 
tary graphs have been constructed to capture the 
spatio-temporal and the appearance information. Af¬ 
terwards, MOT has been formulated as a consistent 
labeling problem in the associated graphs. The pro¬ 
posed solution is based on difference of convex pro¬ 
gramming, for which we have provided both the joint 
as well as node-wise label optimization solutions. We 
show that node-wise label propagation allows us to 
scale up the algorithm with the number of nodes. Two 
further extensions of the proposed approach have 
been investigated. First, we have proposed a parallel 
implementation of the node-wise label propagation. 
Second, the node-wise decomposition has been em¬ 
bedded in an incremental graph construction step. 

9. This observation is not reported in Figure pi 


Interesting paths to investigate in future research 
include the extensions of the framework to embed 
higher order motion models in the spatio-temporal 
graph construction, and to handle the range of fea¬ 
tures confidence levels in a continuous manner. This 
would be in contrast with our current approach, 
which turns the variable reliability of the features into 
sporadic measurements through hard thresholding. 
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Fig. 6. Sample graphs and label evolution on a subset of detections from PETS dataset. Top row shows 
the input detections and the three constructed graphs. For clarity, edges that have weights smaller than 10 -2 are 
suppressed. Bottom row depicts the evolution of label of the nodes along with the corresponding labeling energy. 
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